500 metagenome-assembled microbial genomes from 30 subtropical estuaries in South China

As a unique geographical transition zone, the estuary is considered as a model environment to decipher the diversity, functions and ecological processes of microbial communities, which play important roles in the global biogeochemical cycle. Here we used surface water metagenomic sequencing datasets to construct metagenome-assembled genomes (MAGs) from 30 subtropical estuaries at a large scale along South China. In total, 500 dereplicated MAGs with completeness ≥ 50% and contamination ≤ 10% were obtained, among which more than one-thirds (n = 207 MAGs) have a completeness ≥ 70%. These MAGs are dominated by taxa assigned to the phylum Proteobacteria (n = 182 MAGs), Bacteroidota (n = 110) and Actinobacteriota (n = 104). These draft genomes can be used to study the diversity, phylogenetic history and metabolic potential of microbiota in the estuary, which should help improve our understanding of the structure and function of these microorganisms and how they evolved and adapted to extreme conditions in the estuarine ecosystem.

draft genomes were classified into 491 bacteria and 9 archaea. A vast majority of them belong to the phyla Proteobacteria (36.4%), Bacteroidota (22%), and Actinobacteria (20.8%) ( Table 1; Fig. 1). However, only 62 (12.4%) could be classified to current known taxa at species level with 438 (87.6%) representing currently uncultured species. For fully utilizing the genome data, statistics of quality control on metagenomic raw reads is provided in Supplementary Table S1. Assembly information is provided in Supplementary Table S2. Predicted taxon for each MAG, as well as bin statistics (e.g., completeness, contamination, size and N50), are provided in Supplementary Table S3. MAGs abundance in each estuary is provided in Supplementary Table S4 and associated environmental variables is given in Supplementary Table S5.
To the best of our knowledge, this is the largest number of microbial genomes from the largest number of estuaries to be reported in a single study, which should help facilitate future studies in understanding the structure and function of these microorganisms and how they evolved and adapted to the extreme conditions of the estuarine ecosystems.

Methods
Sample sites and sample collection. A total of 90 surface water samples were collected in December 2018 from 30 sites that spanned the estuary of 30 main rivers in South China, a range of ~1300 km (Fig. 2). At each estuary, triplicate samples were collected, approximately 30-50 m apart. 500 mL water was filtered for the metagenome sequencing through 0.22-μm pore polycarbonate membranes (Millipore Corporation, Billerica, MA, USA), as most prokaryotes are larger than that size. The filtration was performed within 4~8 h and the filter membranes were quick-frozen in liquid nitrogen and then stored at −80 °C until DNA extraction. DNA extraction, metagenomic sequencing and assembly. Total   www.nature.com/scientificdata www.nature.com/scientificdata/ FastTree v2.1.10 to infer an initial guide tree. Both trees contain non-parametric bootstrap support values. The tree was viewed and annotated using Itol 24 (https://itol.embl.de).

Data records
The raw sequence data are available on the NCBI Sequence Read Archive (PRJNA730330) 25 . 500 MAGs, the genome trees are available in figshare 26 . They have been appropriately specified in the text where required.

Technical Validation
To validate the completeness and contamination of the genomes, we accessed the number of marker genes present in all MAGs using CheckM v1.1.3 27 (checkm lineage_wf-tab_table -g -x faa -e 1e-10 -l 0.7). It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage. Completeness and contamination scores are estimated by detecting the presence and number of single-copy marker genes in the draft genome. An uncontaminated and complete MAG will have all of these marker genes present just once in the genome. This final catalog comprises of only those genomes that met specific quality thresholds (i.e., completeness ≥ 50% and contamination < 10%) as described in the manuscript. Additionally, to improve the quality (i.e., increasing completion and reducing contamination), the bins were reassembled in metaWRAP.

Code availability
Custom scripts were not used to generate or process this dataset. Software versions and non-default parameters used have been appropriately specified where required.